Method and system for reducing lexical ambiguity

ABSTRACT

A method and system for reducing lexical ambiguity in an input stream are described. In one embodiment, the input stream is broken into tokens. The tokens are used to create a connection graph comprising a number of paths. Each of the paths is assigned a cost. At least one best path is defined based upon a corresponding cost to generate an output graph. The generated output graph is provided to reduce lexical ambiguity.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to language translation systems.More particularly, the present invention relates to a method forreducing lexical ambiguity.

[0003] 2. Background Information

[0004] With the continuing growth of multinational business dealingswhere the global economy brings together business people of allnationalities and with the ease and frequency of today's travel betweencountries, the demand for a machine-aided interpersonal communicationsystem that provides accurate near real-time language translation,whether in spoken or written form, is a compelling need. This systemwould relieve users of the need to possess specialized linguistic ortranslational knowledge.

[0005] A typical language translation system functions by using naturallanguage process. Natural language processing is generally concernedwith the attempt to recognize a large pattern or sentence by decomposingit into small subpatterns according to linguistic rules. A naturallanguage processing system uses considerable knowledge about thestructure of the language, including what the words are, how wordscombine to form sentences, what the words mean, and how word meaningscontribute to sentence meanings. However, linguistic behavior cannot becompletely accounted for without also taking into account another aspectof what makes humans intelligent—their general world knowledge and theirreasoning abilities. For example, to answer questions, to participate ina conversation, or to create and understand written language, a personnot only must have knowledge about the structure of the language beingused, but also must know about the world in general and theconversational setting in particular. Specifically, phonetic andphonological knowledge concerns how words are related to sounds thatrealize them. Morphological knowledge concerns how words are constructedfrom more basic units called morphemes. Syntactic knowledge concerns howwords can be put together to form correct sentences and determines whatstructural role each word plays in the sentence and what phrases aresubparts of what other phrases. Typical syntactic representations oflanguage are based on the notion of context-free grammars, whichrepresent sentence structure in terms of what phrases are subparts ofother phrases. This syntactic information is often presented in a treeform. Semantic knowledge concerns what words mean and how these meaningscombine in sentences to form sentence meanings. This is the study ofcontext-independent meaning—the meaning a sentence has regardless of thecontext in which it is used. The representation of thecontext-independent meaning of a sentence is called its logical form.The logical form encodes possible word senses and identifies thesemantic relationships between the words and phrases.

[0006] Natural language processing systems further includeinterpretation processes that map from one representation to the other.For instance, the process that maps a sentence to its syntacticstructure and logical form is called parsing, and it is performed by acomponent called a parser. The parser uses knowledge about word and wordmeaning, the lexicon, and a set of rules defining the legal structures,the grammar, in order to assign a syntactic structure and a logical formto an input sentence.

[0007] Formally, a context-free grammar of a language is a four-tuplecomprising nonterminal vocabularies, terminal vocabularies, a finite setof production rules, and a starting symbol for all productions. Thenonterminal and terminal vocabularies are disjoint. The set of terminalsymbols is called the vocabulary of the language. Pragmatic knowledgeconcerns how sentences are used in different situations and how useaffects the interpretation of the sentence.

[0008] A natural language processor receives an input sentence,lexically separates the words in the sentence, syntactically determinesthe types of words, semantically understands the words, pragmaticallydetermines the type of response to generate, and generates the response.The natural language processor employs many types of knowledge andstores different types of knowledge in different knowledge structuresthat separate the knowledge into organized types.

[0009] The complexity of the natural language process is increased dueto lexical ambiguity of input sentences. Cases of lexical ambiguity mayhinge on the fact that a particularly word has more than one meaning.For example, the word bank can be used to denote either a place wheremonetary exchange and handling takes place or the land close river, thebank of the river. A word or a small group of words may also have two ormore related meanings. That is, the adjective bright may be used as asynonym for “shining” (e.g., “The stars are bright tonight”) or as asynonym for “smart” (e.g., “She must be very bright if she made an “A”on the test”). In the field of spoken language translation, the problemis compounded by words that are not necessarily spelled the same but arepronounced the same and have different meanings. For example, the wordsnight and knight are pronounced exactly the same although they arespelled differently, and they have very different meanings.

[0010] Factors causing the lexical ambiguity vary from one language toanother. In character-based languages, e.g. in the Japanese language,extracting information from an input sentence creates a serious problembecause Japanese sentences do not have spaces between words.Part-of-speech (POS) tags are another factor causing lexical ambiguity.In many languages, including both word-based and character-based naturallanguages, one word may have more than one POS tag depending on thecontext of POS within the sentence. The word table, for example, can bea verb in some contexts (e.g., “He will table the motion”) and a noun inothers (e.g., “The table is ready”). The existence of multiwordexpressions in many languages, including the English language, is yetanother factor contributing to lexical ambiguity. That is, depending onthe context, a group of words, such as “white house”, can be treated asa multiword expression (e.g., “I want to visit the White House”) or asseparate words (e.g., “He lives in a white house across the street”).

[0011] One current approach that deals with lexical ambiguity in aJapanese input sentence involves treating each Japanese character as aword and letting the parser group the characters using the parsinggrammar. After the parser defines the words, the parser must try all POStags found for each word and rule out the impossible tags. As a result,the parsing program is time consuming and requires a large amount ofspace for its operation. If a long or complicated sentence is involved,such a parser may not be able to perform the parsing at all.

[0012] Another current approach to deal with lexical ambiguityrecognizes all the possible words in a Japanese sentence and then findspossible connections between adjacent words. The recognition of all thewords is done using a morpheme dictionary. The morpheme dictionarydefines Japanese morphemes with the names of POS tags. The connectivityis defined using a connection-pair grammar. The connection-pair grammardefines pairs of sets of morphemes that may occur adjacently in asentence. Various costs are then applied to the morphemes to compare allpossible segmentations of the input sentence. These various costscorrespond to the likelihood of observing a word as a certain part ofspeech and to the likelihood of observing two words in adjacentpositions. In this approach, the segmentation that has the lowestcorresponding cost is selected from all the possible segmentations ofthe input sentence for further processing. However, the segmentationselected based upon the lowest costs may not correspond to the correctmeaning of the input sentence. Since the syntactic parser is betterequipped to recognize the correct meaning of the input sentence, makinga selection before the parsing operation may result in loss of pertinentinformation. Consequently, this approach may lead to inaccurate resultsin producing a response to an input sentence; especially in producing aresponse to a longer or more complicated sentence. The techniquescurrently used to deal with lexical ambiguity in an English sentencehave problems similar to those identified above. Unlike Japanesesentences, English sentences do not need to be segmented as theindividual words form the segments. However, multiple POS tags of a wordpresent the same problem for English sentences as they do for Japanesesentences. As described above, one approach taken to deal with thisproblem requires the parser to try all POS tags found for each word andrule out the impossible tags. In this approach, the parsing program isvery time consuming and requires a large amount of space for itsoperation. In addition, this approach may not be able to handle long andcomplicated sentences.

[0013] Another approach analyzes all POS tags for each word in anEnglish input sentence and finds the most likely POS tag for each wordusing lexical and statistical probabilities. However, some probabilitiesmay be hard to estimate. No matter how much text is analyzed for theestimation, there will always be a large volume of words that appearonly a few times. Thus, relying strictly on probabilities may not resultin an accurate interpretation, especially in dealing with a long orcomplex sentence in which a word's meaning is dependent upon the contextof the word within the sentence. As explained earlier, since thesyntactic parser is better equipped to recognize the correct meaning ofthe input sentence, making a selection before the parsing operation mayresult in loss of pertinent information.

[0014] Therefore, what is required is an efficient way of reducinglexical ambiguity which will provide an accurate interpretation of aninput sentence without unreasonably burdening the operation of thesyntactic parser.

SUMMARY OF THE INVENTION

[0015] A method and system for reducing lexical ambiguity in an inputstream are described. In one embodiment, the input stream is broken intotokens. The tokens are used to create a connection graph comprising anumber of paths. Each of the paths is assigned a cost. At least one bestpath is defined based upon a corresponding cost to generate an outputgraph. The generated output graph is provided to reduce lexicalambiguity.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] The present invention is illustrated by way of example and may bebetter understood by referring to the following description inconjunction with the accompanying drawings, in which like referencesindicate similar elements and in which:

[0017]FIG. 1 is a block diagram of one embodiment for an architecture ofa computer system;

[0018]FIG. 2a is a block diagram of one embodiment for a naturallanguage translation system;

[0019]FIGS. 2b, 2 c, and 2 dare exemplary diagrams of structures used bythe natural language translation system of FIG. 2a;

[0020]FIG. 3 is a diagram of one embodiment for a lexical ambiguitymodule;

[0021]FIG. 4 is a flow diagram of one embodiment for reducing lexicalambiguity in a natural language translation system:

[0022]FIG. 5aillustrates an exemplary connection graph;

[0023]FIG. 5billustrates an exemplary path in a connection graph;

[0024]FIG. 6 is a flow diagram of one embodiment for segmentation of aninput stream;

[0025]FIG. 7 is a flow diagram of one embodiment for reducing lexicalambiguity in an input English expression;

[0026]FIG. 8 is a flow diagram of one embodiment for reducing lexicalambiguity in an input Japanese expression;

[0027]FIG. 9 illustrates an exemplary connection of tokens in an inputJapanese sentence.

DETAILED DESCRIPTION OF AN EMBODIMENT OF THE PRESENT INVENTION

[0028] A method and system for reducing lexical ambiguity in an inputstream are described. In one embodiment, the input stream is broken intotokens. The tokens are used to create a connection graph comprising anumber of paths. Each of the paths is assigned a cost. At least one bestpath is defined based upon a corresponding cost to generate an outputgraph. The generated output graph is provided to reduce lexicalambiguity.

[0029] In the following detailed description of the present invention,numerous specific details are set forth in order to provide a thoroughunderstanding of the present invention. However, it will be apparent toone skilled in the art that the present invention may be practicedwithout these specific details. In some instances, well-known structuresand devices are shown in block diagram form, rather than in detail, inorder to avoid obscuring the present invention.

[0030] Some portions of the detailed descriptions that follow arepresented in terms of algorithms and symbolic representations ofoperations on data bits within a computer memory. These algorithmicdescriptions and representations are the means used by those skilled inthe data processing arts to most effectively convey the substance oftheir work to others skilled in the art. An algorithm is here, andgenerally, conceived to be a self-consistent sequence of processingblocks leading to a desired result. The processing blocks are those.requiring physical manipulations of physical quantities. Usually, thoughnot necessarily, these quantities take the form of electrical ormagnetic signals capable of being stored, transferred, combined,compared, and otherwise manipulated. It has proven convenient at times,principally for reasons of common usage, to refer to these signals asbits, values, elements, symbols, characters, terms, numbers, or thelike.

[0031] It should be borne in mind, however, that all of these andsimilar terms are to be associated with the appropriate physicalquantities and are merely convenient labels applied to these quantities.Unless specifically stated otherwise as apparent from the followingdiscussion, it is appreciated that throughout the description,discussions utilizing terms such as “processing” or “computing” or“calculating” or “determining” or “displaying” or the like, refer to theaction and processes of a computer system, or similar electroniccomputing device, that manipulates and transforms data represented asphysical (electronic) quantities within the computer system's registersand memories into other data similarly represented as physicalquantities within the computer system memories or registers or othersuch information storage, transmission or display devices.

[0032] The present invention also relates to apparatus for performingthe operations herein. This apparatus may be specially constructed forthe required purposes, or it may comprise a general purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program may be stored in a computerreadable storage medium, such as, but is not limited to, any type ofdisk including floppy disks, optical disks, CD-ROMs, andmagnetic-optical disks, read-only memories (ROMs), random accessmemories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any typeof media suitable for storing electronic instructions, and each coupledto a computer system bus.

[0033] The algorithms and displays presented herein are not inherentlyrelated to any particular computer or other apparatus. Various generalpurpose systems may be used with programs in accordance with theteachings herein, or it may prove convenient to construct morespecialized apparatus to perform the required method steps. The requiredstructure for a variety of these systems will appear from thedescription below. In addition, the present invention is not describedwith reference to any particular programming language. It will beappreciated that a variety of programming languages may be used toimplement the teachings of the invention as described herein.

[0034] Lexical ambiguity is a recognized problem in natural languageprocessing. The problem with lexical ambiguity arises when a naturallanguage processor needs to extract information from an input sentencefor the subsequent syntactic parsing. Extracting information becomesproblematic in character-based languages which do not have separatorssuch as spaces between words in a sentence. In addition, in manylanguages, a word may have different part of speech (POS) tags dependingon the context of the POS within the sentence. In some languages,certain groups of words can either be treated as multiword expressionsor as separate words depending on the context. In one embodiment, thelexical ambiguity reduction module provides a method for reducinglexical ambiguity in an input sentence which increases the segmentationof the input sentence and supports POS tagging and multiword processing.In this embodiment, the lexical ambiguity module provides a graph whichis passed to a syntactic analysis module for subsequent processing. Inone embodiment, an efficient method of reducing lexical ambiguity isprovided which allows the language processing system to produce anaccurate interpretation of the input sentence without unreasonablyburdening the operation of the syntactic analysis module.

[0035]FIG. 1 is a block diagram of one embodiment for an architecture ofa computer system 100. Referring to FIG. 1, computer system 100 includessystem bus 101 that allows for communication among processor 102,digital signal processor 108, memory 104, and non-volatile storagedevice 107. System bus 101 may also receive inputs from keyboard 122,pointing device 123, and speech signal input device 125. System bus 101provides outputs to display device 121, hard copy device 124, and outputdevice 126 (such as, for example, an audio speaker). Memory 104 mayinclude, for example, read only memory (ROM), random access memory(RAM), flash memory, or any combination of the above.

[0036] It will be appreciated that computer system 100 may be controlledby operating system software which includes a file management system,such as, for example, a disk operating system, which is part of theoperating system software. The file management system may be stored innon-volatile storage device 107 and may be configured to cause processor102 to execute the various functions required by the operating system toinput and output data and to store data in memory 104 and onnon-volatile storage device 107.

[0037]FIG. 2ais a block diagram of one embodiment for a natural languagetranslation system 200. Referring to FIG. 2a, natural languagetranslation system 200 includes five modules, supporting databases, andassociated grammars to quickly and accurately translate text betweensource and target languages. Input text may be directly input intonatural language translation system 200 (for example, as with a persontyping sentences into a computer using keyboard 122). Alternatively,input text to natural language translation system 200 may be the outputof another system, such as, for example, output from a speechrecognition system (for example, speech input device 125), or from anoptical character recognition system (not shown).

[0038] An English sentence “He wants to go to the White House” is usedthroughout this section as example text input to describe thefunctioning of the system 200. The individual units in a sentence arereferred to herein as “words” but the natural language translationsystem 200 is not limited to only word-based natural languages, havingequal applicability to translation of character-based languages as well.Except where the differences in processing word-based andcharacter-based languages are specified, the term “word” is intended toencompass both words and characters.

[0039] In the following description, a grammar is generally a set ofcontext-free rules that define the valid phrase structures in the sourceor target language, with each context-free rule associated with one ormore statements (the “rule body”) that perform tests and manipulationson the linguistic representations (feature structures). Thus, an Englishsentence may be combined from a noun phase (NP) and a verb phrase (VP),but the subject and verb forms must agree, e.g., “He want to go to theWhite House” is a valid phrase structure but an improper Englishsentence. All rule bodies utilized by the grammars of languagetranslation system 200 are in the form of computer-executable routinesproduced by defining the grammar in terms of a grammar programminglanguage (GPL) and passing appropriate rule bodies (209, 215,219, and225) through a GPL compiler 240. The output of the GPL compiler 240 maybe in the form of directly executable code, or may be in the form ofstandard computer programming language statements (such as, for example,C, C++, Pascal, or Lisp) which are then input into the correspondingprogramming language compiler to produce executable code. In eithercase, the compiled grammars include a specific function for eachcontext-free rule. The specific function performs all the processingrequired by the rule and its associated rule body. Furthermore, theinterfaces between the compiled grammars and the modules enable a singlelanguage translation system 200 to perform translation between multiplenatural languages, and to perform more than one translationsimultaneously.

[0040] A morphological analysis module 206 takes text input 202 and usesa source language dictionary 204 to decompose the words into morphemesby identifying root forms, grammatical categories, thesaurusinformation, and other lexical features of the words. The morphologicalanalysis module 206 builds a “feature structure” for each word. Featurestructures are well known in the art as linguistic data structures thatcontain feature-value pairs for strings, symbols, and numbers thatappear in a natural language sentence. Each feature of a word is mappedto the appropriate value through a function commonly designated as:

[0041] Thus, a simplified, exemplary representation of the featurestructures for the words “he” and “wants” are as- follows:$\begin{matrix}\left. I\mapsto\begin{bmatrix}\text{root : he} \\\text{cat : pronoun}\end{bmatrix} \right. & \left( {{Feature}\quad {Structure}\quad 1} \right) \\\left. {wants}\mapsto\begin{bmatrix}\begin{bmatrix}\text{root : want} \\\text{cat : noun}\end{bmatrix} \\{OR} \\\begin{bmatrix}\text{root : want} \\\text{cat : verb}\end{bmatrix}\end{bmatrix} \right. & \left( {{Feature}\quad {Structure}\quad 2} \right)\end{matrix}$

[0042] The Feature Structure 2 may be referred to as a “disjunctive”feature structure as it represents two mutually exclusive featurestructures that are valid for the word. It will be appreciated that thegrammatical category is not the only feature of these two words and thatmorphological analysis module 206 outputs full feature structures. Theexample feature structures are simplified for the sake of clarity inexplanation and are also frequently represented by a shorthand notation,e.g., [want] or NP[].

[0043] The feature structures built by morphological analysis module 206are input into lexical ambiguity reduction module 210. In oneembodiment, lexical ambiguity reduction module 210 may segment the wordsin character-based languages that do not utilize spaces through adatabase of lexical connector feature rules 208. Lexical connectorfeature rules 208 are created from GPL grammar statements as describedabove. Each possible combination of adjacent segmented words areassigned a lexical cost. Dictionary 204 defines combinations of words(“multiwords”). Lexical ambiguity reduction module 210 evaluates eachfeature structures that contains a part-of-speech (POS) ambiguity, suchas the feature structure for the word “wants” which is tagged as both anoun and a verb. The various possible POS tags are assigned a lexicalcost. Lexical ambiguity reduction module 210 weighs the cost assigned toeach word in the sentence and selects those feature structures that havethe lowest cost.

[0044] The feature structures chosen for the words by lexical ambiguityreduction module 210 are passed to syntactic analysis module 216.Syntactic analysis module 216 combines the chosen feature structuresinto a feature structure that represents the content of the inputsentence. In one embodiment, syntactic analysis module 216 uses parsinggrammar 212 to create a syntax parse tree for the sentence. Parsinggrammar 212 contains the source language context-free grammar rules inthe form of a parsing table and the associated rule bodies in executablecode. Each leaf of the syntax parse tree is a feature structure for oneof the words in the sentence. Once the leaves are created, anintermediate feature structure for each branch (parent) node in thesyntax parse tree is built by combining its child nodes as specified inone or more of the context-free grammar rules. The rule body for eachpotentially applicable context-free grammar rule manipulates the variousfeature structures at the child nodes and determines whether theassociated context-free rule could create a valid phrase from thepossible combinations. A rule body may cause a thesaurus 214 to bequeried as part of the manipulation. It will be appreciated that thefeature structure that results from applying the context-free grammarrules may be nested (i.e., contain multiple feature structures from eachchild node). Syntactic analysis module 216 may create the syntax parsetree shown in FIG. 2b for the example sentence from its constituentfeature structures, with the following feature structure at the top(root) of the syntax parse tree to represent the full sentence:$\begin{matrix}{S->\begin{bmatrix}\text{SUBJ "he"} \\\text{VERB "wants to go"} \\\text{OBJ "to the White House"}\end{bmatrix}} & \left( {{Feature}\quad {Structure}\quad 3} \right)\end{matrix}$

[0045] It will be appreciated that both the syntax parse tree 250 andthe Feature Structure 3 are not exact representations but are simplifiedfor purposes of ease in explanation.

[0046] The feature structure for the sentence in the source language ispassed to transfer module 222. The feature structure represents theanalysis of the source input and may contain a number of nestedlinguistic representations (referred herein as sub-structures or slots).Transfer module 222 uses transfer grammar 218 to match source languageslots of the input with source language slots in example database 220.Example database 220 contains feature structure pairs in the sourcelanguage and a target language. For example, database 220 may containmatching feature structures in English and Japanese. Transfer grammar218 consists of a set of rewrite rules with a context-free component anda GPL rule body. The context-free parts of the rules are used to createa transfer generation tree.

[0047] Transfer module 222 uses the GPL rule bodies within transfergrammar 218 to match the input source sub-structures or slots to thesource sub-structures or slots in example database 220. If a good matchis found (in one embodiment, a low overall match cost), transfer module222 checks if all sub-structures or slots of the input feature structurehave found a match. If a match for a sub-structure is not found, thesub-structure is used as input to transfer module 222. A transfergeneration tree of the form shown in FIG. 2c is used to break thesub-structure into multiple sub-structures. The new input may be a partof the original, source feature structure or a new feature sub-structurethat is constructed from sections of different slots.

[0048] Transfer module 222 uses the input feature structure (orsub-structure) in the source language as the starting symbol to buildtransfer generation tree 260. Root 261 is a symbol-node (s-node) and islabeled with the starting symbol of the feature structure. The transfergrammar determines which transfer grammar rules are applicable to thefeature structure at the root 261, and creates child rule-node(s)(r-node) 263 depending from root 261. In one embodiment, r-nodes 263 arethe rule numbers within transfer grammar 218 that may be validly appliedto the input feature structure. Transfer grammar 218 rules added to tree260 are applied to the s-nodes 265. If the application of each rulesucceeds, a child rule-node (r-node) 265 is added to tree 260. If theapplication fails, the s-node 261 is tagged as “dead” for sub-sequentremoval. Transfer grammar 218 then creates a new s-node 265 for eachr-node 263. Again, the applicable rules are found for each s-node 265and applied. The process is repeated until all sub-features within thefeature structure have been expanded. Transfer generation tree 260 isthen pruned to remove any “dead” nodes and corresponding sub-trees. Ifroot 261 is tagged as “dead,” the generation fails. Otherwise, theresulting transfer generation tree 260 is used by transfer module 222 tomatch the feature structure against the example database 220. Thefeature structures and sub-structures in the target language associatedwith a match are substituted for the corresponding feature structuresand sub-structures matched in the source language. Transfer module 222recursively applies the transfer rules contained within transfer grammar218 from the top-most transfer rules until all meaningful sub-featuresor constituents within the input source feature structure aretransferred to the target sub-structures. The transfer module 222 willconsult the thesaurus 214 when required to do so by a transfer rule.Transfer module 222 outputs a feature structure in the target language.

[0049] The feature structure for the sentence in the target language ispassed to a morphological and syntactical generation module 228, whereit is used as the root node for a syntactical generation tree, anexample of which is shown in FIG. 2d. The syntactical generation tree isbuilt in the same fashion as the transfer generation tree, withcontext-free rules in a generation grammar 224 as its r-nodes 273. Thegeneration grammar 224 copies information to each s-node 275, 279.Unlike the transfer module 226, in which multiple sub-transfers createdmultiple transfer generation trees, only one syntactical generation treeis created by the morphological and syntactical generation module 228.Any s-node that is not a leaf node 279, i.e., associated with a featurestructure for a word, is used to generate the next level of r-nodes.When all child s-nodes under an r-node are leaf nodes, the currentbranch of the tree is complete and the morphological and syntacticalgeneration module 228 traverses back up the tree to find the next s-nodethat is not a leaf node. The thesaurus 214 is consulted when necessaryduring the generation of the tree. The transfer generation tree iscomplete when all the lowest level s-node are leaf nodes.

[0050] When the syntactical generation tree is complete, the leaf nodescontain output feature structures representing the words in one or moretranslations of the input sentence. The sequence of output featurestructures that represents the best sentence is converted into outputtext 230 by the morphological and syntactical generation module 228using the dictionary 226. Alternatively, all output feature structuresfor all sentences may be converted into the output text 230.

[0051] Lexical ambiguity reduction module 210 of FIG. 2A will now bedescribed in more detail. FIG. 3 is a diagram of one embodiment forlexical ambiguity reduction module 210 of FIG. 2a. Referring to FIG. 3,lexical ambiguity reduction module 210 comprises tokenizer 306,segmentation and POS engine 320 and grammar programming language (GPL)compiler 312. It will be recognized by one skilled in the art that awide variety of other engines other than that discussed above may beused by the lexical ambiguity reduction module 300 without loss ofgenerality.

[0052] In one embodiment, tokenizer 306 receives input string 302comprising a sequence of words and breaks it into individual tokens 308.A token may comprise, for example, a full word, a reduced word, anumber, a symbol, or a punctuation character. In a Japanese sentence, inwhich there are no spaces between words, each Japanese character maycorrespond to a token. Tokenizer 306 examines the local context of theword or character within the sentence or phrase, or the currentcharacter and its immediate neighbors. Tokenizer 306 may use a small setof tokenization rules 304. In one example of an English languagesentence, tokenizer 306 may make a break at the following places withthe corresponding effect:

[0053] space character (space, return, tab, End-of-Sentence (EOS));

[0054] apostrophe+space character (“Doris'”→“Doris” “'”);

[0055] apostrophe+“s” (“Peter's”→“Peter” “'s”);

[0056] apostrophe+“re” (“they're”→“they” “'re”);

[0057] apostrophe+“d” (“Peter'd”→“Peter” “'d”);

[0058] apostrophe+“ve” (“Peter've”→“Peter” “'ve”);

[0059] apostrophe+“ll” (“Peter'll”→“Peter” “'ll”);

[0060] period+EOS (“Peter likes fish.”→“Peter” “likes” “fish” “.”);

[0061] question mark (“Does Peter like fish?”→“does” “Peter” “like”“fish” “?”);

[0062] exclamation mark (“Fish!”→“fish” “!”);

[0063] comma (except between numbers) (“apples, oranges andbananas”→“apples” “,” “oranges” “and” “bananas”);

[0064] dollar sign (“$30”→“$” “30”);

[0065] percent sign (“30%”→“30” “%”);

[0066] plus sign (“+80 ”→“+” “80”);

[0067] minus sign (only when followed by a number) (“−3 ”→“−” “3”);

[0068] semicolon (“fruits; apples, oranges and bananas”→“fruits” “;”“apples” “,” “oranges” “and” “bananas”);

[0069] colon (except between numbers).

[0070] In one embodiment, segmentation and POS engine 320 receivestokens 308 and performs one or more of its assigned functions, such as,for example, segmentation, POS tagging, and multiword processing. Eachof the named functions is described in more detail below.

[0071] During segmentation, segmentation and POS engine 320 makespossible connections between tokens by consulting lexical dictionary 316and lexical functions 314. In one embodiment, lexical dictionary 316comprises lexical entries in the format of feature structures. Eachlexical entry stored in lexical dictionary 316 may have a correspondingPOS information. In alternate embodiments, a wide variety of otherlexical information may be stored in lexical dictionary 316. In oneembodiment, lexical dictionary 316 may also contain a multiworddictionary used in the multiword processing as described below.Alternatively, multiword information may be stored in a separatedictionary.

[0072] Lexical functions 314 represent lexical grammar rules 310. In oneembodiment, lexical grammar rules 314 result from pre-compiling lexicalgrammar rules 310 using GPL compiler 312. In this embodiment, lexicalfunctions related to tokens 308 may be selected from lexical functions314. Alternatively, lexical grammar rules related to tokens 308 may beselected from lexical grammar rules 310 and compiled by GPL compiler 312to generate lexical functions related to tokens 308.

[0073] Lexical grammar rules 310 may be written in GPL. In oneembodiment, lexical grammar rules 310 comprise Japanese lexical grammarrules. In an alternate embodiment, lexical grammar rules 310 maycomprise various grammar rules of any other language and may berepresented by a wide variety of other programming languages orstructures. The Japanese grammar rules may include rules definingconnectivity relation of tokens.

[0074] In one embodiment, GPL compiler 312 compiles rules selected fromlexical grammar rules 310 to generate lexical functions 314. Asdescribed above, lexical functions 314 may include calls to featurestructure library routines which allow flexibility in developing lexicalgrammar rules. This flexibility becomes especially important whencomplex and space consuming rules are involved, such as lexicalconnector rules defining connectivity relation of tokens.

[0075] After defining all possible connections of tokens 308,segmentation and POS engine 320 may perform POS tagging. Alternatively,POS tagging may be performed simultaneously with the segmentationprocess. In another embodiment, POS tagging may be performed withoutperforming segmentation (for example, in word-based natural languages).Segmentation and POS engine 320 performs POS tagging by consultinglexical dictionary 316 and assigning all possible POS tags to eachsegmented word of the input sentence 302. In one embodiment,segmentation and POS engine 320 searches lexical dictionary 316 forevery segmented word of input sentence 302. As described above, lexicaldictionary 316 may comprise lexical entries for words in the format offeature structures. Once the segmented word is found, segmentation andPOS engine 320 retrieves all corresponding POS tags contained within thefeature structure of this word.

[0076] In one embodiment, multiword processing is also performed todefine multiword expressions in the input sentence. Segmentation and POSengine 320 performs multiword processing by consulting a multiworddictionary which may be included in the lexical dictionary 316 orcontained in a separate dictionary. The multiword processing isdescribed in more detail below.

[0077] In one embodiment, segmentation and POS engine 320 creates aconnection graph comprising a plurality of paths defined by all possiblesegmentations of input sentence 302 and/or various POS tags assigned toeach segmented word in input sentence 302. Multiword expressions mayalso be reflected in the connection graph. The content of the connectiongraph and the process of its creation are explained below. Segmentationand POS engine 320 compares the paths in the connection graph. In oneembodiment, the comparison is done using lexical cost file 318 whichcontains various lexical cost information. The information in lexicalcost file 318 may include, for example, lexical costs, unigram costs,bigram costs and connector costs.

[0078] Lexical costs correspond to the probability of observing acertain word as a certain part of speech. For example, the probabilityof observing word “bank” as a noun may be higher than the probability ofobserving word “bank” as a verb. Unigram cost or POS costs correspond tothe probability of observing a particular part of speech, regardless ofwhat the particular word is or what the surrounding parts of speech are.For example, the probability of observing a noun within any sentence maybe higher than the probability of observing a determiner. Bigram costscorrespond to the probability of observing a sequence of two particularparts of speech together, regardless of what the words are. For example,the probability of observing a determiner followed by a noun may behigher than a probability of observing a noun followed by a determiner.Connector costs correspond to the probability of observing twoparticular words in adjacent positions. Consider a Japanese sentence, inwhich two different words, word 1 and word 2, may be created startingfrom a certain position depending on their lengths. Let's say that word1 is created by combining six characters and word 2 is created bycombining eight characters, which include the same six characters plustwo characters immediately following the six characters. Word 3 in ourexample is a word which ends immediately before word 1 and word 2 start.In our example, the connector costs may reflect that the probability ofobserving word 3 in a position adjacent to word 2 may be higher than theprobability of observing word 3 adjacent to word 1, or vice versa.

[0079] Lexical cost information may be stored in a database or it may bedivided with one portion being stored along with lexical grammar rules310 and another portion being stored with POS information in lexicaldictionary 316. In alternate embodiments, a variety of means for storingthe lexical cost information may be used.

[0080] Based on the costs assigned to each path, segmentation and POSengine 320 selects the best paths within the connection graph that havelower costs. The best paths are used to generate output graph 322 whichis provided to syntactic analysis module 216 of FIG. 2a for furtherprocessing. Output graph 322 contains the information needed bysyntactic analysis module 216 for making an accurate finalinterpretation of the input sentence. In addition, the operation ofsyntactic analysis module 216 is simplified because only pertinentinformation (i.e., lexical feature structures for best paths as opposedto all possible paths) is passed to syntactic analysis module 216. Thus,the present invention may provide an accurate response to an inputsentence, without consuming unreasonable amount of memory and processingtime.

[0081]FIG. 4 is a flow diagram of one embodiment for reducing lexicalambiguity in a natural language translation system. Initially, atprocessing block 404, an input stream is passed to lexical ambiguityreduction module 300 of FIG. 3. The input stream may be, for example, afull sentence, a reduced sentence, a word, a number, a symbol, or apunctuation character. At processing block 406, the input stream isbroken into tokens. In one embodiment, the input stream is broken intoat least two tokens. The number of tokens varies depending upon thelanguage, length and complexity of the input stream, and applicabletokenization rules 304, as described above.

[0082] At processing block 408, the tokens are used to create aconnection graph. The connection graph may be created by finding allpossible connections between tokens (i.e. performing segmentation of theinput stream). The process of segmentation is described in more detailbelow. Regardless of whether the input stream requires segmentation, POStagging and/or multiword processing may need to be performed. Asdescribed above, POS tagging involves finding all possible POS tags foreach word in the input stream by consulting a lexical dictionary.

[0083] Multiword processing involves defining all possible multiwordexpressions in the input stream using a multiword dictionary. Themultiword dictionary comprises multiword expressions (“multiwords”) inthe format of feature structures. Consider the words “White House” inthe sentence “I want to visit the White House.” Valid feature structuresfor the combination may include: $\begin{matrix}\left. {white}\mapsto\begin{bmatrix}\text{root : white} \\\text{cat : adj}\end{bmatrix} \right. & \left( {{Feature}\quad {Structure}\quad 4} \right) \\\left. {house}\mapsto\begin{bmatrix}\begin{bmatrix}\text{root : house} \\\text{cat : noun}\end{bmatrix} \\{OR} \\\begin{bmatrix}\text{root : house} \\\text{cat : verb}\end{bmatrix}\end{bmatrix} \right. & \left( {{Feature}\quad {Structure}\quad 5} \right)\end{matrix}$

[0084] An equally valid feature structure for the combination may be:$\begin{matrix}\left. {{White}\quad {House}}\mapsto\begin{bmatrix}\text{root : White House} \\\text{cat : proper noun}\end{bmatrix} \right. & \left( {{Feature}\quad {Structure}\quad 6} \right)\end{matrix}$

[0085] If Feature Structure 4 and Feature Structure 5 are found togetherin the multiword dictionary, then the combination “White House” isdefined as a multiword and Feature Structure 6 is retrieved.

[0086] Referring again to processing block 408 of FIG. 4, the connectiongraph comprises a set of nodes and a set of arcs. A node corresponds toa separator between two words. An arc corresponds to a token andconnects two nodes. An arc may be labeled with a corresponding part ofspeech tag. An example of a connection graph is shown in FIG. 5a. Afterthe connection graph with the plurality of paths is created, each of theplurality of paths is assigned a cost, as shown in processing block 410.Each arc comprising the path has a cost associated with it. Whenprocessing character-based languages, e.g. the Japanese language, eachnode may also have a cost associated with it. As described above, thesecosts may be obtained from lexical cost file 318 and may include, forexample, lexical costs, unigram costs, bigram costs, connector costs. Inone embodiment, the cost assigned to each path results from summing allcosts defined for every arc and, if applicable, every node in this path.The process of calculating the cost for each path will be described inmore detail below.

[0087] At processing block 412, at least one best path is selected fromthe plurality of paths based upon a corresponding cost. In oneembodiment, costs of all possible paths are weighed and those with lowercosts are selected to generate an output graph. The selection of pathsis described in more detail below. At processing block 414, the outputgraph comprising the best paths is provided to syntactic analysis module216 for further processing. In the examples described, selection of thebest paths reduces lexical ambiguity in the input stream before thesyntactic analysis module 216 begins its parsing operation, therebysimplifying the parsing process. In one embodiment, lexical ambiguityreduction module 210 provides syntactic analysis module 216 with all theinformation it may need for producing an accurate interpretation of theinput stream.

[0088]FIG. 5a illustrates an exemplary connection graph for the inputexpression “I want to visit the White House.” Specifically, each pair ofnodes 2 through 16 are connected by arcs 22 through 42. Arcs 22 through42 are labeled with corresponding part of speech tags. For example, theword “visit” 50 is separated by nodes 8 and 10. Because the word “visit”50 may have at least two part of speech tags, such as, for example, averb and a noun, nodes 8 and 10 are connected by at least two arcs. Inthe example, arc 30 corresponds to a verb (“v”) and arc 32 correspondsto a noun (“n”). The word “House” 54 is separated by nodes 14 and 16which are connected by arc 38, representing a verb tag, and arc 40,representing a noun tag. In addition, the word “House” 54 is a part of amultiword “White House” 56, which is a proper noun, as shown in FeatureStructure 6. As a result, arc 42 is created connecting nodes 12 and 16to represent the multiword expression with a POS tag of a proper noun.Possible combinations of arcs and nodes define a plurality of paths inthe connection graph. The number of possible paths may vary depending onhow many arcs represent each word in the input stream. If each word inthe input stream has only one arc representing it, then the connectiongraph comprises only one path. Typically, however, more than one path isdefined in the connection graph. In the example, twelve different pathsmay be defined in the connection graph based on all possiblecombinations of the arcs and nodes. FIG. 5b illustrates an exemplarypath of one of the twelve possible paths of FIG. 5a. Referring to FIG.5b, the exemplary path consists of the combination of arcs 22, 24, 28,30, 34 and 42 and corresponding nodes.

[0089]FIG. 6 is a flow diagram of one embodiment for segmenting an inputstream. The segmentation process is used in character-based languages,e.g. the Japanese language, which do not have separators such as spacesbetween words. A task of the segmentation process is to recognize allthe possible words (or segments) in the given input stream and findpossible connections between adjacent words. Initially at processingblock 504, tokens are received. In one embodiment, at least two tokensare received. At processing block 508, lexical functions may be selectedfrom a collection of lexical functions 314. In one embodiment, lexicalfunctions 314 result from pre-compiling lexical grammar rules 310 usinga GPL compiler. Lexical grammar rules 310 may be written in GPL and maydefine connectivity relation of tokens. Lexical functions 314 may callfeature structure library routines. As described above, the output ofGPL compiler 312 may be in the form of directly executable code or maybe in the form of standard computer programming language statements.Either approach provides a flexible method to develop grammar ruleswhich becomes especially important for rules defining connectivityrelation of tokens in character-based languages due to manipulation oflarge amount of data involved in presentation of these rules.

[0090] At processing block 512, segments are created from the tokensbased upon the lexical functions and lexical dictionary. The createdsegments define all possible segmentations of the input stream. Thecreation of the segments may include finding various combinations of thetokens and then determining all possible connections between thesevarious combinations. That is, the lexical information retrieved fromlexical dictionary 316 may be used to define which tokens may becombined. Based upon all possible combinations, a number of lexicalitems (segments) may be created, in which every lexical item resultsfrom combining one or more tokens of the input stream. Then, lexicaldictionary 316 and the lexical functions maw be used to determine whichadjacent segments may be connected. The segments that have validconnections define all possible segmentations of the input stream.

[0091] At processing block 514, a connection graph is generated fromthese segments. The connection graph represents all possiblesegmentations of the input stream and is subsequently processed bysegmentation and POS engine 320 to generate an output graph. In oneembodiment, the time consuming segmentation process may be performedefficiently, thereby improving the overall performance of thetranslation system.

[0092]FIG. 7 is a flow diagram of one embodiment for reducing lexicalambiguity in an input English sentence. At processing block 704,tokenization of an input English sentence is performed by breaking theinput English sentence into tokens. The number of tokens resulting fromtokenizing an English sentence varies depending on the lengths andcomplexity of the sentence. At processing block 706, a connection graphis created using the tokens. As described above, the connection graphcomprises a set of nodes and a set of arcs.

[0093] At processing block 708, all possible POS tags are defined foreach word in the sentence by consulting lexical dictionary 316. In oneembodiment, each word in the sentence comprises at least one token. Whenmore than one POS tag is found in the lexical dictionary for a word, anarc is added in the connection graph to represent every additional POStag found. Every arc is labeled with a corresponding POS tag. The FIG.5a example shows all the arcs defined for every word in the inputsentence “I want to visit the White House.” The elements of theconnection graph are described in more details above.

[0094] Referring to FIG. 7, at processing block 710, multiwordexpressions are defined by consulting a multiword dictionary. Asdescribed above, an arc is added to define each multiword in thesentence. Based upon all possible POS tags and multiwords in theSentence, a plurality of paths is defined in the connection graph. Eachpath represents a combination of arcs and nodes in the connection graph.In the example shown in FIG. 5a, twelve different paths may be definedin the connection graph based on all possible combinations of the arcsand nodes. The FIG. 5b example illustrates one of the twelve possiblepaths which consists of the combination of arcs 22, 24, 28, 30, 34 and42 and corresponding nodes.

[0095] Referring to FIG. 7, at processing block 714, each path in theconnection graph is assigned a cost. This cost is a total of overallcosts calculated for all the arcs contained in the path. In oneembodiment, the overall cost calculated for all the arcs in the pathincludes a lexical cost, a POS (or unigram cost) and a bigram cost. Inthe example shown in FIG. 5a, the lexical cost assigned to arc 24 may belower than the cost assigned to arc 26 because the word “want” may beused more often as a verb than as a noun. The unigram cost or POS costcorresponds to the probability of observing this particular part ofspeech, regardless of what the word is or what the surrounding parts ofspeech are. For example, the unigram cost assigned to arc 24 may behigher than the unigram cost assigned to arc 26 because verbs in generalmay be considered to be used more often than nouns. The bigram costcorresponds to the probability of observing a sequence of two particularparts of speech, regardless of what the words are. The bigram cost isassigned to each pair of connected arcs. For example, the bigram costassigned to the combination of arcs 22 and 24 may be lower than thebigram cost assigned to the combination of arcs 22 and 26 because thesequence of a pronoun and a verb may be more probable than the sequenceof a pronoun and a noun. Thus, the total cost assigned to eachpath,includes lexical costs assigned to each arc in the path, unigramcosts assigned to each arc in the graph and bigram costs assigned toeach pair of arcs in the graph. In one embodiment, when a path comprisesan arc defining a multiword expression (e.g., arc 42 in FIG. 5a or 5 b),the cost for this arc is derived from the multiword entry in themultiword dictionary.

[0096] Referring to FIG. 7, at processing block 716, the n best pathsare selected from all the paths in the connection graph. The selectionis based upon a cost assigned to each path. The number (“n”) of bestpaths selected may be predefined based upon a variety of factors, suchas, for example, a desired level of accuracy, the complexity ofinformation being processed, or time constraints associated with theprocess. In an alternate embodiment, the number of best paths may bedetermined by segmentation and POS engine 320 during operation basedupon various factors. In another embodiment, the number of best pathsmay be varied depending upon a certain percentage defined to limit costsof selected best paths. For example, this percentage may be set to 10%.If the lowest among the costs assigned to paths in the connection graphequals to 20, then only the paths with costs not exceeding the lowestcosts for more than ten percent may be selected as best paths. Forexample, if path 1 has a cost of 10, path 2 has a cost of 11.8, path 3has a cost of 12.2, and path 4 has a cost of 14, only paths 1 and 2 areselected as best paths because costs of paths 3 and 4 exceed the cost ofpath 1 for more than 10%. In alternate embodiments, a variety of methodsfor determining the number of best paths may be used. The selected nbest paths are then used to generate an output graph 718 as describedabove.

[0097]FIG. 8 is a flow diagram of one embodiment for reducing lexicalambiguity in an input Japanese sentence. At processing block 804,tokenization of an input Japanese sentence is performed by breaking theinput Japanese sentence into tokens. Because a typical Japanese sentencedoes not have separators such as spaces between words, each Japanesecharacter in the sentence may correspond to a token. At processing block806, the tokens are combined in all possible combinations to define avariety of lexical entries (segments) in the sentence using lexicaldictionary 316. FIG. 9 illustrates an exemplary connection of tokens inan input Japanese sentence. Referring to FIG. 9, tokens 50 through 80are combined in various ways. Combinations of tokens are made to matchany entry in lexical dictionary 316. For example, token 50 by itself mayhave a matching lexical entry in lexical dictionary 316, or acombination of tokens 50 and 52 may have a matching lexical entry inlexical dictionary 316. All combinations that have matching entries inlexical dictionary 316 are analyzed to define all the possible lexicalentries (segments) in the input sentence. For example, the combinationof tokens 58 and 60 may define segment 20. In addition, the combinationof the same tokens 58 and 60 along with a token 62 may result in segment22. Furthermore, the combination of tokens 58 and 60 may be a part ofsegment 24.

[0098] Referring to FIG. 8, at processing block 808, the variety ofsegments are connected using lexical dictionary 316 and lexicalfunctions to define possible segmentations of the input sentence. Asdescribed above, in one embodiment, the lexical functions are associatedwith the segments being processed and are selected from the entirecollection of lexical functions 314. Lexical functions 314 result fromcompiling lexical grammar rules using GPL compiler 312. Selected lexicalfunctions define connectivity relation between lexical featurestructures of the input sentence. In one embodiment, based upon lexicalfunctions 314 and lexical dictionary 316, all possible connections foreach lexical feature structure may be defined using features LEX-TO andLEX-FROM assigned to the lexical feature structures of the inputsentence. For every lexical feature structure, features LEX-TO andLEX-FROM may define all possible parts of speech that can be connectedto this segment. That is, the feature LEX-FROM may define all parts ofspeech that may precede this segment and the feature LEX-TO may defineall parts of speech that can immediately follow this segment. If anyvalue in LEX-TO TO and LEX-FROM features of adjacent segments matches,then these two segments may be connected. As shown in FIG. 9, lexicalfeature structure 26 defined by arc 5 has a value “noun-part” in itsLEX-TO feature. The same value is contained in a LEX-FROM feature oflexical feature structure 28 defined by arc 7. Thus, a valid connectioncan be made between these two adjacent segments. Each segment may beconnected to more than one preceding segment and to more than onefollowing segment. For example, segment 24 may be connected to at leasttwo preceding segments (e.g., segments 30 and 32) if any value in itsLEX-FROM feature matches with any value in a LEX-TO feature of everypreceding segment. In addition, segment 24 may be connected to at leasttwo following segments (e.g., segments 26 and 34) if any value in itsLEX-TO feature matches with any value in a LEX-FROM feature of everyfollowing segment. In one embodiment, segments that do not have eitherpreceding or following connections are ignored. The rest of the segmentsmay be used to define all possible segmentations of the input sentence.

[0099] Referring to FIG. 8, at processing block 810, each segment isassigned all POS tags found for this lexical entry in lexical dictionary316. FIG. 9 shows sample POS tags assigned to arcs 1 trough 11.

[0100] Referring to FIG. 8, at processing block 812, the segments andcorresponding POS tags may be used to create a connection graph. In oneembodiment, the process of creating a connection graph for a Japanesesentence may be the same as the process of creating a connection graphfor an English sentence. As described above, the connection graphcomprises a set of nodes and a set of arcs. Each arc corresponds to aPOS tag of a segment. Various combinations of arcs and nodes define aplurality of paths in the connection graph.

[0101] At processing block 814, each path is assigned a cost. Asdescribed above, this cost is a total of all costs calculated for everyarc and node contained in the path. In a Japanese sentence, the costcalculated for every arc in the path may include a lexical cost and aPOS (or unigram) cost. In addition, the segmentation process may involvea connector cost which is assigned to each node in the path. Theconnector cost corresponds to the probability of observing two types ofwords in adjacent positions. That is, each of all possible connectionsmade between adjacent segments may carry a connector cost associatedwith this particular connection. Thus, in one embodiment, the costassigned to each path may include lexical costs assigned to each arc inthe paths, unigram costs assigned to each arc in the graph and connectorcosts assigned to each node in the graph. In alternate embodiments, anyother way of calculating a cost for a path may be used.

[0102] At processing block 816, n best paths are selected from all thepaths within the connection graph. In one embodiment, the selection isbased upon a cost assigned to each path. The number of best paths isdetermined as described above. The selected n best paths are used togenerate an output graph which is passed to syntactic analysis module216.

[0103] A method and system for reducing lexical ambiguity in an inputstream have been described. The method breaks the input stream intotokens and creates a connection graph using the tokens. If needed, themethod may perform segmentation of the input stream, POS tagging ormultiword processing. Results received in any of the above processes areused to define a plurality of paths in the connection graph. The methodassigns a cost to each of the plurality of paths. Based upon assignedcosts, at least one best path is selected from the plurality of paths.The method uses the at least one best path to generate an output graph.The output graph is passed to a syntactic analysis module to reducelexical ambiguity. With the present invention, an efficient way ofreducing lexical ambiguity is provided which produces an accurateinterpretation of the input stream without unreasonably burdening theoperation of the syntactic analysis module.

[0104] Several variations in the implementation of the method forreducing lexical ambiguity have been described. The specificarrangements and methods described here are illustrative of theprinciples of this invention. Numerous modifications in form and detailmay be made by those skilled in the art without departing from the truespirit and scope of the invention. Although this invention has beenshown in relation to a particular embodiment, it should not beconsidered so limited. Rather it is limited only by the appended claims.

What is claimed is:
 1. A method for reducing lexical ambiguity in aninput stream, comprising: breaking the input stream into at least twotokens; creating a connection graph using the at least two tokens, theconnection graph comprising a plurality of paths; assigning a cost toeach of the plurality of paths; defining at least one best path basedupon a corresponding cost to generate an output graph; and providing theoutput graph to a syntactic analysis module to reduce lexical ambiguity.2. The method of claim 1 wherein a number of the at least one best pathis either predefined or determined programmatically.
 3. The method ofclaim 1 wherein creating a connection graph using the at least twotokens comprises: compiling lexical grammar rules to generate lexicalfunctions, the lexical grammar rules being written in a grammarprogramming language; creating a plurality of segments from the at leasttwo tokens based upon lexical information and the lexical functions, anddefining the plurality of paths using the plurality of segments.
 4. Themethod of claim 1 wherein creating a connection graph using the at leasttwo tokens comprises assigning at least one part of speech tag to atleast one of the at least two tokens using lexical information.
 5. Themethod of claim 1 wherein creating a connection graph using the at leasttwo tokens comprises recognizing a multiword expression in the inputstream using multiword information.
 6. The method of claim 1 wherein theconnection graph comprises a set of nodes and a set of arcs.
 7. Themethod of claim 6 wherein each of the plurality of paths comprises acombination of nodes and arcs.
 8. The method of claim 1 wherein the costcomprises lexical cost, unigram cost, bigram cost and connector cost. 9.A method for providing segmentation of an input stream having at leasttwo tokens, comprising: creating a plurality of segments from the atleast two tokens based upon lexical information and lexical functions;and generating a connection graph using the plurality of segments. 10.The method of claim 9 further comprising compiling lexical grammar rulesto generate the lexical functions, the lexical grammar rules beingwritten in a grammar programming language.
 11. The method of claim 10wherein the lexical grammar rules define connectivity relation oftokens.
 12. The method of claim 9 further comprising assigning at leastone part of speech tag to at least one segment using a lexicaldictionary.
 13. The method of claim 12 further comprising: defining aplurality of paths in the connection graph based upon part of speechtags and the segments; assigning a cost to each of the plurality ofpaths; and determining at least one best path based upon a correspondingcost to generate an output graph.
 14. An apparatus for reducing lexicalambiguity in an input stream, comprising: means for breaking the inputstream into at least two tokens; means for creating a connection graphusing the at least two tokens, the connection graph comprising aplurality of paths; means for assigning a cost to each of the pluralityof paths; means for defining at least one best path based upon acorresponding cost to generate an output graph; and means for providingthe output graph to a syntactic analysis module to reduce lexicalambiguity.
 15. The apparatus of claim 14 wherein a number of the atleast one best path is either predefined or determined programmatically.16. The apparatus of claim 14 further comprising means for compilinglexical grammar rules to generate lexical functions, the lexical grammarrules being written in a grammar programming language; means forcreating a plurality of segments from the at least two tokens based uponlexical information and the lexical functions, and means for definingthe plurality of paths using the plurality of segments.
 17. Theapparatus of claim 14 further comprising means for assigning at leastone part of speech tag to at least one of the at least two tokens usinglexical information.
 18. The apparatus of claim 14 further comprisingmeans for recognizing a multiword expression in the input stream usingmultiword information.
 19. The apparatus of claim 14 wherein theconnection graph comprises a set of nodes and a set of arcs.
 20. Theapparatus claim 19 wherein each of the plurality of paths comprises acombination of nodes and arcs.
 21. The apparatus of claim 14 wherein thecost comprises lexical cost, unigram cost, bigram cost and connectorcost.
 22. An apparatus for providing segmentation of an input streamhaving at least two tokens, comprising: means for creating a pluralityof segments from the at least two tokens based upon lexical informationand lexical functions; and means for generating a connection graph usingthe plurality of segments.
 23. The apparatus of claim 22 furthercomprising means for compiling lexical grammar, rules to generate thelexical functions, the lexical grammar rules being written in a grammarprogramming language.
 24. The apparatus of claim 23 wherein the lexicalgrammar rules define connectivity relation of tokens.
 25. The apparatusof claim 22 further comprising means for assigning at least one part ofspeech tag to at least one segment using a lexical dictionary.
 26. Theapparatus of claim 25 further comprising: means for defining a pluralityof paths in the connection graph based upon part of speech tags and thesegments; means for assigning a cost to each of the plurality of paths;and means for determining at least one best path based upon acorresponding cost to generate an output graph.
 27. An apparatus forreducing lexical ambiguity in an input stream, comprising: a tokenizerfor breaking the input stream into at least two tokens; a tokenconnector for creating a connection graph using the at least two tokens,the connection graph comprising a plurality of paths; a cost assignorfor assigning a cost to each of the plurality of paths; a pathcalculator for defining at least one best path based upon acorresponding cost to generate an output graph; and a graph provider forproviding the output graph to a syntactic analysis module to reducelexical ambiguity.
 28. The apparatus of claim 27 wherein a number of theat least one best path is either predefined or determinedprogrammatically.
 29. The apparatus of claim 27 wherein the tokenconnector comprises: a grammar programming language (GPL) compiler forcompiling lexical grammar rules to generate lexical functions, thelexical grammar rules being written in a general programming language; asegmentation engine for creating a plurality of segments from the atleast two tokens based upon lexical information and the lexicalfunctions, and a path designator for defining the plurality of pathsusing the plurality of segments.
 30. The apparatus of claim 27 whereinthe token connector comprises a part of speech tagger for assigning atleast one part of speech tag to at least one of the at least two tokensusing lexical information.
 31. The apparatus of claim 27 wherein thetoken connector comprises a multiword recognizer for recognizing amultiword expression in the input stream using multiword information.32. The apparatus of claim 27 wherein the connection graph comprises aset of nodes and a set of arcs.
 33. The apparatus of claim 32 whereineach of the plurality of paths comprises a combination of nodes andarcs.
 34. The apparatus of claim 27 wherein the cost comprises lexicalcost, unigram cost, bigram cost and connector cost.
 35. An apparatus forproviding segmentation of an input stream having at least two tokens,comprising: a segmentation engine for creating a plurality of segmentsfrom the at least two tokens based upon lexical information and lexicalfunctions; and a graph generator for generating a connection graph usingthe plurality of segments.
 36. The apparatus of claim 35 furthercomprising a grammar programming language (GPL) compiler for compilinglexical grammar rules to generate the lexical functions, the lexicalgrammar rules being written in GPL.
 37. The apparatus of claim 36wherein the lexical grammar rules define connectivity relation oftokens.
 38. The apparatus of claim 35 further comprising a part ofspeech tagger for assigning at least one part of speech tag to at leastone segment using lexical information.
 39. The apparatus of claim 38further comprising: a path designator for defining a plurality of pathsin the connection graph based upon part of speech tags and the segments;a cost assignor for assigning a cost to each of the plurality of paths;and a path calculator for determining at least one best path based upona corresponding cost to generate an output graph.
 40. A system forreducing lexical ambiguity, comprising: a processor; an input coupled tothe processor, the input capable of receiving an input stream, theprocessor configured to break the input stream into at least two tokens,create a connection graph comprising a plurality of paths using the atleast two tokens, assign a cost to each of the plurality of paths, anddefine at least one best path based upon a corresponding cost togenerate an output graph; and an output coupled to the processor, theoutput capable of providing the output graph to a syntactic analysismodule to reduce lexical ambiguity.
 41. A system for providingsegmentation of an input stream, comprising: a processor; an inputcoupled to the processor, the input capable of receiving an input streamhaving at lest two tokens, the processor configured to create aplurality of segments from the at least two tokens based upon lexicalinformation and lexical functions, and generate a connection graph usingthe plurality of segments; and an output coupled to the processor, theoutput capable of providing segmentation of the input stream.
 42. Acomputer readable medium comprising instructions, which when executed ona processor, perform method for reducing lexical ambiguity in an inputstream, comprising: breaking an input stream into at least two tokens;creating a connection graph using the at least one token, the connectiongraph comprising a plurality of paths; assigning a cost to each of theplurality of paths; defining at least one best path based upon acorresponding cost to generate an output graph; and providing the outputgraph to a syntactic analysis module to reduce lexical ambiguity. 43.The computer readable medium of claim 42 wherein creating a connectiongraph further comprises providing segmentation of the input stream usinglexical information and lexical functions.
 44. The computer readablemedium of claim 42 wherein creating a connection graph further comprisesassigning at least one part of speech tag to at least one of the atleast two tokens using lexical information.
 45. The computer readablemedium of claim 42 wherein creating a connection graph further comprisesrecognizing a multiword expression in the input stream using lexicalinformation.
 46. The computer readable medium of claim 42 wherein anumber of the at least one best path is either predefined or determinedprogrammatically.
 47. A computer readable medium comprisinginstructions, which when executed on a processor, perform method forproviding segmentation of an input stream having at least two tokens,comprising: creating a plurality of segments from the at least twotokens based upon lexical information and lexical functions; andgenerating a connection graph using the plurality of segments.
 48. Thecomputer readable medium of claim 47 further comprising compiling thelexical grammar rules to generate lexical functions, the lexical grammarrules being written in a grammar programming language.
 49. A memory forstoring data for access by an application program being executed on adata processing system, comprising: a data structure stored in saidmemory, said data structure including information resident in a fileused by said application program and including: a plurality of packetstructures used for the transmission of data, wherein each packetstructure includes a set of nodes, a set of arcs connecting at least twoof the set of nodes, and a value data object for each of the set of arcshaving a value that represents a corresponding part of speech tag.